babies

Introduction

The Natality.csv contains information of Natality in each state. This includes the baby’s gender, birth weight, mother’s age and total number of births. The Census2015.cvs file contains the median age of the total population in each state.

R Configuration

Below is our sessionInfo().

sessionInfo(package=NULL)
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 14393)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] backports_1.0.5 magrittr_1.5    rprojroot_1.2   tools_3.3.2    
##  [5] htmltools_0.3.5 yaml_2.1.14     Rcpp_0.12.10    stringi_1.1.2  
##  [9] rmarkdown_1.3   knitr_1.15.1    stringr_1.1.0   digest_0.6.11  
## [13] evaluate_0.10

ETL Script

source("../01 Data/ETL.R")
## Loading required package: readr
## Warning: package 'readr' was built under R version 3.3.3
## Loading required package: plyr
## Parsed with column specification:
## cols(
##   State = col_character(),
##   State_Code = col_integer(),
##   Gender = col_character(),
##   Gender_Code = col_character(),
##   Race = col_character(),
##   Births = col_integer(),
##   Average_Birth_Weight = col_double(),
##   Average_Age_Mother = col_double()
## )
## Classes 'tbl_df', 'tbl' and 'data.frame':    404 obs. of  8 variables:
##  $ State               : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ State_Code          : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ Gender              : chr  "Female" "Female" "Female" "Female" ...
##  $ Gender_Code         : chr  "F" "F" "F" "F" ...
##  $ Race                : chr  "American Indian or Alaska Native" "Asian or Pacific Islander" "Black or African American" "White" ...
##  $ Births              : int  100 576 9116 19514 90 617 9313 20331 1179 565 ...
##  $ Average_Birth_Weight: num  3202 3112 2947 3228 3317 ...
##  $ Average_Age_Mother  : num  27 30.6 26 27.4 26.5 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 8
##   .. ..$ State               : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ State_Code          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Gender              : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Gender_Code         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Race                : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Births              : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Average_Birth_Weight: list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Average_Age_Mother  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"
## Classes 'tbl_df', 'tbl' and 'data.frame':    404 obs. of  8 variables:
##  $ State               : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ State_Code          : Factor w/ 51 levels "1","10","11",..: 1 1 1 1 1 1 1 1 11 11 ...
##  $ Gender              : Factor w/ 2 levels "Female","Male": 1 1 1 1 2 2 2 2 1 1 ...
##  $ Gender_Code         : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 2 2 1 1 ...
##  $ Race                : Factor w/ 4 levels "American Indian or Alaska Native",..: 1 2 3 4 1 2 3 4 1 2 ...
##  $ Births              : Factor w/ 375 levels "100","10091",..: 1 306 362 121 361 317 368 127 31 302 ...
##  $ Average_Birth_Weight: Factor w/ 399 levels "2876.45","2926.51",..: 160 59 3 190 306 187 37 326 339 196 ...
##  $ Average_Age_Mother  : Factor w/ 280 levels "24.82","25.28",..: 76 243 25 99 51 241 26 97 39 137 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 8
##   .. ..$ State               : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ State_Code          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Gender              : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Gender_Code         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Race                : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Births              : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Average_Birth_Weight: list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Average_Age_Mother  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated

## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated

## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated
## [1] "State_Code"
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = 0): invalid factor
## level, NA generated
## [1] "Births"
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = 0): invalid factor
## level, NA generated
## [1] "Average_Birth_Weight"
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = 0): invalid factor
## level, NA generated
## [1] "Average_Age_Mother"
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = 0): invalid factor
## level, NA generated
## Classes 'tbl_df', 'tbl' and 'data.frame':    404 obs. of  8 variables:
##  $ State               : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ State_Code          : num  1 1 1 1 1 1 1 1 2 2 ...
##  $ Gender              : Factor w/ 2 levels "Female","Male": 1 1 1 1 2 2 2 2 1 1 ...
##  $ Gender_Code         : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 2 2 1 1 ...
##  $ Race                : Factor w/ 4 levels "American Indian or Alaska Native",..: 1 2 3 4 1 2 3 4 1 2 ...
##  $ Births              : num  100 576 9116 19514 90 ...
##  $ Average_Birth_Weight: num  3202 3112 2947 3228 3317 ...
##  $ Average_Age_Mother  : num  27 30.6 26 27.4 26.5 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 8
##   .. ..$ State               : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ State_Code          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Gender              : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Gender_Code         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Race                : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Births              : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Average_Birth_Weight: list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Average_Age_Mother  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"
## CREATE TABLE 01_DataNatality (
##  Natality 
##  State varchar2(4000),
##  Gender varchar2(4000),
##  Gender_Code varchar2(4000),
##  Race varchar2(4000),
##  State_Code number(38,4),
##  Births number(38,4),
##  Average_Birth_Weight number(38,4),
##  Average_Age_Mother number(38,4)
##  );
require(readr)
require(plyr)

file_path = "../01 Data/Natality.csv"
natality <- readr::read_csv(file_path)
names(natality)

df <- natality

names(df)
str(df) # Uncomment this line and  run just the lines to here to get column types to use for getting the list of measures.

measures <- c("State_Code","Births","Average_Birth_Weight","Average_Age_Mother")

dimensions <- setdiff(names(df), measures)
dimensions

# Get rid of special characters in each column.
# Google ASCII Table to understand the following:
for(n in names(df)) {
  df[n] <- data.frame(lapply(df[n], gsub, pattern="[^ -~]",replacement= ""))
}
str(df)

na2emptyString <- function (x) {
  x[is.na(x)] <- ""
  return(x)
}

if( length(dimensions) > 0) {
  for(d in dimensions) {
    # Change NA to the empty string.
    df[d] <- data.frame(lapply(df[d], na2emptyString))
    # Get rid of " and ' in dimensions.
    df[d] <- data.frame(lapply(df[d], gsub, pattern="[\"']",replacement= ""))
    # Change & to and in dimensions.
    df[d] <- data.frame(lapply(df[d], gsub, pattern="&",replacement= " and "))
    # Change : to ; in dimensions.
    df[d] <- data.frame(lapply(df[d], gsub, pattern=":",replacement= ";"))
  }
}

na2zero <- function (x) {
  x[is.na(x)] <- 0
  return(x)
}
# Get rid of all characters in measures except for numbers, the - sign, and period.dimensions, and change NA to 0.
if( length(measures) > 1) {
  for(m in measures) {
    print(m)
    df[m] <- data.frame(lapply(df[m], gsub, pattern="[^--.0-9]",replacement= ""))
    df[m] <- data.frame(lapply(df[m], na2zero))
    df[m] <- lapply(df[m], function(x) as.numeric(as.character(x)))
    #df[m] <- data.frame(lapply(df[m], as.numeric(as.character))) # This is needed to turn measures back to numeric because gsub turns them into strings.
  }
}

str(df)

write.csv(df, gsub("Natality", "Natality", file_path), row.names=FALSE, na = "")

tableName <- gsub(" +", "_", gsub("[^A-z, 0-9, ]", "", gsub(".csv", "", file_path)))
sql <- paste("CREATE TABLE", tableName, "(\n Natality \n")
if( length(measures) > 0 || ! is.na(dimensions)) {
  for(d in dimensions) {
    sql <- paste(sql, paste(d, "varchar2(4000),\n"))
  }
}
if( length(measures) > 0 || ! is.na(measures)) {
  for(m in measures) {
    if(m != tail(measures, n=1)) sql <- paste(sql, paste(m, "number(38,4),\n"))
    else sql <- paste(sql, paste(m, "number(38,4)\n"))
  }
}
sql <- paste(sql, ");")
cat(sql)

The Cleaned up data

The cleaned up data file, Natality.csv, is on conneyc’s account under dataset S17 DV Project 5. The TAs have been added as contributors and can download the file from the link provided below.

https://data.world/conneyc/s-17-dv-project-5

The Census Data

The data file containing the census data we are using, Census2015.csv, is on conneyc’s account under dataset S17 DV Project 5. The TAs have been added as contributors and can download the file from the link provided below.

https://data.world/conneyc/s-17-dv-project-5

Visualizations

Boxplots

Boxplot Tableau


This visualization is a boxplot of state vs births. It is filtered by Race and Births and has the detail of the gender of the births. It is observed that California has the largest number of births for female and males followed by Texas.

Boxplot Shiny


This is the same data visualization done in Shiny.


Histograms

Histogram Tableau


This Treemap displays the count of the average age of mothers for each age in 0.5 bins, by race. It is colored by the count, with the race and average age as text. When the mouse is hovered over a box, it tells you the count number. For Asians, it is seen that most mothers are 30.5 years old, white mothers are 28, African Americas are 26, and Natives are 26.5. Asian mothers have babies the latest while African Americans have them the earliest.

Histogram Shiny

Natives
Histogram Shiny

Asian/Pacific Islanders
Histogram Shiny

African Americans
Histogram Shiny


White These Shiny histograms show the same result as the Tableau treemap, with the addition of a vertical average line for the average average-median-age for each race. The average line is close to the highest bar for each race except African American, where the average line is about a year later than the most common average-median-age.


Scatterplots

Scatterplot Tableau


This scatterplot compares the median age and number of births of each state. Each point signifies the median age of the state and number of births of each gender for a specific race within that state. The shape of the points classify the race and the color classifies the state. A trend line was added which shows Births = -682.411*Median_Age + 35676.1, meaning as the median age of the state increases, there appears to be a decrease in the number of births.

Scatterplot Shiny


This is the same visualization done in Shiny.


Crosstabs

Crosstab + KPI Tableau


This graph shows average birth weight per state by race. It is colored by the calculated field (KPI): AVG(average birth weight)/4000. The number 4000 is an average healthy birth weight in the US. The data reflects that white babies tend to weigh more than the other races.

Crosstab + KPI Shiny1


This is the same visualization done in Shiny.


Crosstab + KPI 3 Tableau


This graph was created by joining Natality.csv with Census2015.csv through State= AreaName. It shows the total number of births for each race per state using the data from Natality.csv. The number of births for corresponding states is colored by the Median_Age column from Census2015.csv using parameters. The parameters are low: 0-34.7, medium: 34.7-39.3, and high: 39.3+. Regardless of median age, Whites have the largest number of births followed by Black or African, Asian or Pacific Islander then American Indian or Alaska Native. However, this trend is most likely a reflection of the existing demographic where the majority of the population is White. Texas is a low Median age state and has a high number of births.

Crosstab + KPI Shiny2


This is the same visualization done in Shiny.


Barcharts

Bar Chart Tableau


This graph shows average birth weight in grams for each race in each state and the reference line, but the calculated difference from average is applied. The calculated difference is the average birth weight (for that race in that state) minus the total average birth weight of all the states. It is used for color and text and displayed by the mark labels, the color of the bars, and numbers are shown when the mouse is hovered over each bar. The data reflects that White babies and American Indian/Alaskan babies tend to weigh more than the average baby for each state.

Bar Chart Shiny


This is the Shiny version. The states chosen are states in the deep south, and the trend of White and Native babies weighing more continues for these states. The average line shows the average birth weight for the state, and the calculated difference is the difference between the average birth weight in the state and the average birth weight for the specific race in the state.


High Median Age States Tableau


This graph used data from Census2015 (Median Age) and Natality (Births, Race) joined by State. It shows the sum of births in High Median Age States, which is a set of states which an average median age greater than 40. Each bar color is broken down by the number of births contributed by each race within that state. Average Median Age is put into detail. Maine has the highest median age. All these states have the largest number of births that are White, followed by Black, Asian and Native. Of these high median age states, Florida has the largest number of births.

Bar Chart Shiny2


The Shiny version is the same for this visualization.


Shiny Application

Our visualizations are published in an application on our accounts:

https://kellyjennings.shinyapps.io/finalproject/